Intro to Data Wrangling

Data Wrangling is the concept of arranging your dataset into a workable format for analysis. When retreiving data from various sources, it is not always in a format that is ready to be analyzed. There are times when there are missing or incorrect values in the dataset which will reduce the integrity of the analysis performed on the dataset. The process of Data Wrangling has been known to take up as much as 70% of a data scientist's time. In this notebook, we will review a few different formats in which you may retrieve data and what are some of the initial techniques to ensure that the data is ready for analysis.

Data Retrieval from CSV files

One of the common formats for datasets is the comma seperated values or csv format. This format works well with both Microsoft Excel and Google Sheets as both platforms allow you to view them in a spreadsheet format. This format is also supported by the pandas DataFrame. In the code below, we read in a csv into a DataFrame, df, and display the first 5 entries.



In [2]:

    
import numpy as np
import pandas as pd
df = pd.read_csv('Master.csv')
df.head(5)









    Out[2]:






  
    
      
      playerID
      birthYear
      birthMonth
      birthDay
      birthCountry
      birthState
      birthCity
      deathYear
      deathMonth
      deathDay
      ...
      nameLast
      nameGiven
      weight
      height
      bats
      throws
      debut
      finalGame
      retroID
      bbrefID
    
  
  
    
      0
      aardsda01
      1981
      12
      27
      USA
      CO
      Denver
      NaN
      NaN
      NaN
      ...
      Aardsma
      David Allan
      205
      75
      R
      R
      4/6/2004
      9/28/2013
      aardd001
      aardsda01
    
    
      1
      aaronha01
      1934
      2
      5
      USA
      AL
      Mobile
      NaN
      NaN
      NaN
      ...
      Aaron
      Henry Louis
      180
      72
      R
      R
      4/13/1954
      10/3/1976
      aaroh101
      aaronha01
    
    
      2
      aaronto01
      1939
      8
      5
      USA
      AL
      Mobile
      1984
      8
      16
      ...
      Aaron
      Tommie Lee
      190
      75
      R
      R
      4/10/1962
      9/26/1971
      aarot101
      aaronto01
    
    
      3
      aasedo01
      1954
      9
      8
      USA
      CA
      Orange
      NaN
      NaN
      NaN
      ...
      Aase
      Donald William
      190
      75
      R
      R
      7/26/1977
      10/3/1990
      aased001
      aasedo01
    
    
      4
      abadan01
      1972
      8
      25
      USA
      FL
      Palm Beach
      NaN
      NaN
      NaN
      ...
      Abad
      Fausto Andres
      184
      73
      L
      L
      9/10/2001
      4/13/2006
      abada001
      abadan01
    
  

5 rows × 24 columns

From the output, you see that this dataset gives information about baseball players. Here is the last 10 entries of our Data Frame.



In [3]:

    
df.tail(10)









    Out[3]:






  
    
      
      playerID
      birthYear
      birthMonth
      birthDay
      birthCountry
      birthState
      birthCity
      deathYear
      deathMonth
      deathDay
      ...
      nameLast
      nameGiven
      weight
      height
      bats
      throws
      debut
      finalGame
      retroID
      bbrefID
    
  
  
    
      18579
      zuberbi01
      1913
      3
      26
      USA
      IA
      Middle Amana
      1982
      11
      2
      ...
      Zuber
      William Henry
      195
      74
      R
      R
      9/16/1936
      9/23/1947
      zubeb101
      zuberbi01
    
    
      18580
      zuberjo01
      1969
      12
      10
      USA
      CA
      Encino
      NaN
      NaN
      NaN
      ...
      Zuber
      Jon Edward
      190
      73
      L
      L
      4/19/1996
      9/27/1998
      zubej001
      zuberjo01
    
    
      18581
      zuletju01
      1975
      3
      28
      Panama
      Panama
      Panama
      NaN
      NaN
      NaN
      ...
      Zuleta
      Julio Ernesto
      230
      78
      R
      R
      4/6/2000
      6/25/2001
      zulej001
      zuletju01
    
    
      18582
      zumayjo01
      1984
      11
      9
      USA
      CA
      Chula Vista
      NaN
      NaN
      NaN
      ...
      Zumaya
      Joel Martin
      215
      75
      R
      R
      4/3/2006
      6/28/2010
      zumaj001
      zumayjo01
    
    
      18583
      zuninmi01
      1991
      3
      25
      USA
      FL
      Cape Coral
      NaN
      NaN
      NaN
      ...
      Zunino
      Michael Accorsi
      220
      74
      R
      R
      6/12/2013
      9/28/2014
      zunim001
      zuninmi01
    
    
      18584
      zupcibo01
      1966
      8
      18
      USA
      PA
      Pittsburgh
      NaN
      NaN
      NaN
      ...
      Zupcic
      Robert
      220
      76
      R
      R
      9/7/1991
      8/4/1994
      zupcb001
      zupcibo01
    
    
      18585
      zupofr01
      1939
      8
      29
      USA
      CA
      San Francisco
      2005
      3
      25
      ...
      Zupo
      Frank Joseph
      182
      71
      L
      R
      7/1/1957
      5/9/1961
      zupof101
      zupofr01
    
    
      18586
      zuvelpa01
      1958
      10
      31
      USA
      CA
      San Mateo
      NaN
      NaN
      NaN
      ...
      Zuvella
      Paul
      173
      72
      R
      R
      9/4/1982
      5/2/1991
      zuvep001
      zuvelpa01
    
    
      18587
      zuverge01
      1924
      8
      20
      USA
      MI
      Holland
      2014
      9
      8
      ...
      Zuverink
      George
      195
      76
      R
      R
      4/21/1951
      6/15/1959
      zuveg101
      zuverge01
    
    
      18588
      zwilldu01
      1888
      11
      2
      USA
      MO
      St. Louis
      1978
      3
      27
      ...
      Zwilling
      Edward Harrison
      160
      66
      L
      L
      8/14/1910
      7/12/1916
      zwild101
      zwilldu01
    
  

10 rows × 24 columns

One of the first things that you can do is to analyze the DataFrame to see if there is anything that doesn't look correct, such as variation in the number of entries in each column, min, max value etc. This can be easily assess through the describe()



In [4]:

    
df.describe()









    Out[4]:






  
    
      
      birthYear
      birthMonth
      birthDay
      deathYear
      deathMonth
      deathDay
      weight
      height
    
  
  
    
      count
      18444.000000
      18274.000000
      18117.000000
      9225.000000
      9224.000000
      9223.000000
      17717.000000
      17780.000000
    
    
      mean
      1929.851496
      6.625534
      15.618977
      1963.294309
      6.485581
      15.546677
      185.562680
      72.235264
    
    
      std
      40.907143
      3.465354
      8.748973
      31.193396
      3.522301
      8.781102
      20.998279
      2.598992
    
    
      min
      1820.000000
      1.000000
      1.000000
      1872.000000
      1.000000
      1.000000
      65.000000
      43.000000
    
    
      25%
      1894.000000
      4.000000
      8.000000
      1941.000000
      3.000000
      8.000000
      170.000000
      71.000000
    
    
      50%
      1935.000000
      7.000000
      16.000000
      1966.000000
      6.000000
      15.000000
      185.000000
      72.000000
    
    
      75%
      1967.000000
      10.000000
      23.000000
      1988.000000
      10.000000
      23.000000
      197.000000
      74.000000
    
    
      max
      1994.000000
      12.000000
      31.000000
      2014.000000
      12.000000
      31.000000
      320.000000
      83.000000

As we can see from this dataset, there is a lot of variation in the count across each column. Some of this is due to the fact that there are baseball players in this dataset who are still alive today, thus there is no information on their death date. Some of this is can just be due to incomplete information being provided. This is where Data Wrangling techniques must be done so that we can have a good dataset to work with.

Retrieving Data using SQL queries

SQL queries can be used on DataFrames to extract the necessary data through the SQLite syntax. The pandasql provides you with the necessary API for SQL queries. Below are a few examples. The first query retrieves the first 10 entries of the birthMonth and birthYear columns.



In [5]:

    
import pandasql
q = 'SELECT birthMonth, birthYear FROM df LIMIT 10'
sql_sol = pandasql.sqldf(q.lower(),globals())
sql_sol

The query below retrieves data on players that are less than 200 lbs and organizes the entries by birthCountry



In [6]:

    
q2 = 'SELECT playerID,birthCountry,bats,weight FROM df WHERE weight < 200 GROUP BY birthCountry LIMIT 10'
sql_sol2 = pandasql.sqldf(q2.lower(),globals())
sql_sol2









    Out[6]:






  
    
      
      playerID
      birthCountry
      bats
      weight
    
  
  
    
      0
      quinnpa01
      None
      None
      162
    
    
      1
      madurca01
      Aruba
      R
      175
    
    
      2
      porraed01
      At Sea
      R
      170
    
    
      3
      willije01
      Australia
      R
      185
    
    
      4
      ulricdu01
      Austria
      R
      195
    
    
      5
      richaan01
      Bahamas
      B
      165
    
    
      6
      martich01
      Belize
      L
      180
    
    
      7
      rienzan01
      Brazil
      R
      190
    
    
      8
      woodpe01
      CAN
      R
      185
    
    
      9
      kingmha01
      China
      L
      165

By using SQL quieries, it makes it simple to retrieve the data that you would like to analyze and then wrangle that data specifically if necessary.

Data Retrieval from JSON formats

Another common format is the JSON format. Data is commonly in this format when you are working with Relational Databases. The example below shows how one can retrieve data from a database using that website's REST API. Here we show to load JSON data into a Python dictionary. Here we use the requests library and then load the data using the json library. Pprint is used to print the JSON in a more readable format.



In [7]:

    
import json
import requests
import pprint
url = 'http://ws.audioscrobbler.com/2.0/?method=album.getinfo&api_key=4beab33cc6d65b05800d51f5e83bde1b&artist=Cher&album=Believe&format=json'
data = requests.get(url).text
df3 = json.loads(data)
pp = pprint.PrettyPrinter(indent=4)
pp.pprint(df)









    



        playerID  birthYear  birthMonth  birthDay birthCountry birthState  \
0      aardsda01       1981          12        27          USA         CO   
1      aaronha01       1934           2         5          USA         AL   
2      aaronto01       1939           8         5          USA         AL   
3       aasedo01       1954           9         8          USA         CA   
4       abadan01       1972           8        25          USA         FL   
5       abadfe01       1985          12        17         D.R.  La Romana   
6      abadijo01       1854          11         4          USA         PA   
7      abbated01       1877           4        15          USA         PA   
8      abbeybe01       1869          11        11          USA         VT   
9      abbeych01       1866          10        14          USA         NE   
10     abbotda01       1862           3        16          USA         OH   
11     abbotfr01       1874          10        22          USA         OH   
12     abbotgl01       1951           2        16          USA         AR   
13     abbotje01       1972           8        17          USA         GA   
14     abbotji01       1967           9        19          USA         MI   
15     abbotku01       1969           6         2          USA         OH   
16     abbotky01       1968           2        18          USA         MA   
17     abbotod01       1888           9         5          USA         PA   
18     abbotpa01       1967           9        15          USA         CA   
19      aberal01       1927           7        31          USA         OH   
20     abercda01       1850           1         2          USA         OK   
21     abercre01       1980           7        15          USA         GA   
22     abernbi01       1929           1        30          USA         CA   
23     abernbr01       1977           9        23          USA         GA   
24     abernte01       1921          10        30          USA         NC   
25     abernte02       1933           3         6          USA         NC   
26     abernwo01       1915           2         1          USA         NC   
27     aberscl01       1921           8        28          USA         IL   
28     ablesha01       1883          10         4          USA         TX   
29     abnersh01       1966           6        17          USA         OH   
...          ...        ...         ...       ...          ...        ...   
18559  zimmejo01       1975           4        28          CAN         BC   
18560  zimmejo02       1986           5        23          USA         WI   
18561  zimmero01       1916           9        13          USA         PA   
18562  zimmery01       1984           9        28          USA         NC   
18563   zinkch01       1979           8        26          USA         CA   
18564   zinkwa01       1898          11        21          USA         MA   
18565   zinnfr01       1865          12        21          USA         PA   
18566   zinngu01       1887           2        13          USA         WV   
18567   zinnji01       1895           1        21          USA         AR   
18568  zinsebi01       1918           1         6          USA         NY   
18569  zinteal01       1968           5        19          USA         TX   
18570  zipfebu01       1938          11        18          USA         IL   
18571   ziskri01       1949           2         6          USA         NY   
18572   zitoba01       1978           5        13          USA         NV   
18573  zitzmbi01       1895          11        19          USA         NY   
18574  zmiched01       1884          10         1          USA         OH   
18575  zobribe01       1981           5        26          USA         IL   
18576  zoccope01       1977           2         6          USA         NY   
18577  zoldasa01       1918          12         8          USA         NY   
18578  zoskyed01       1968           2        10          USA         CA   
18579  zuberbi01       1913           3        26          USA         IA   
18580  zuberjo01       1969          12        10          USA         CA   
18581  zuletju01       1975           3        28       Panama     Panama   
18582  zumayjo01       1984          11         9          USA         CA   
18583  zuninmi01       1991           3        25          USA         FL   
18584  zupcibo01       1966           8        18          USA         PA   
18585   zupofr01       1939           8        29          USA         CA   
18586  zuvelpa01       1958          10        31          USA         CA   
18587  zuverge01       1924           8        20          USA         MI   
18588  zwilldu01       1888          11         2          USA         MO   

              birthCity  deathYear  deathMonth  deathDay    ...      \
0                Denver        NaN         NaN       NaN    ...       
1                Mobile        NaN         NaN       NaN    ...       
2                Mobile       1984           8        16    ...       
3                Orange        NaN         NaN       NaN    ...       
4            Palm Beach        NaN         NaN       NaN    ...       
5             La Romana        NaN         NaN       NaN    ...       
6          Philadelphia       1905           5        17    ...       
7               Latrobe       1957           1         6    ...       
8                 Essex       1962           6        11    ...       
9            Falls City       1926           4        27    ...       
10              Portage       1930           2        13    ...       
11           Versailles       1935           6        11    ...       
12          Little Rock        NaN         NaN       NaN    ...       
13              Atlanta        NaN         NaN       NaN    ...       
14                Flint        NaN         NaN       NaN    ...       
15           Zanesville        NaN         NaN       NaN    ...       
16          Newburyport        NaN         NaN       NaN    ...       
17            New Eagle       1933           4        13    ...       
18             Van Nuys        NaN         NaN       NaN    ...       
19            Cleveland       1993           5        20    ...       
20          Fort Towson       1939          11        11    ...       
21             Columbus        NaN         NaN       NaN    ...       
22             Torrance       2006           2        19    ...       
23              Atlanta        NaN         NaN       NaN    ...       
24                Bynum       2001          11        16    ...       
25              Stanley       2004          12        16    ...       
26          Forest City       1994          12         5    ...       
27              Chicago       1973           6        23    ...       
28              Terrell       1951           2         8    ...       
29             Hamilton        NaN         NaN       NaN    ...       
...                 ...        ...         ...       ...    ...       
18559           Kelowna        NaN         NaN       NaN    ...       
18560        Auburndale        NaN         NaN       NaN    ...       
18561        Pine Grove       1991          11        22    ...       
18562        Washington        NaN         NaN       NaN    ...       
18563        Carmichael        NaN         NaN       NaN    ...       
18564        Pittsfield       1964           6        12    ...       
18565      Phoenixville       1936           5        12    ...       
18566          Holbrook       1949          10         6    ...       
18567            Benton       1991           2        26    ...       
18568           Astoria       1993           2        16    ...       
18569           El Paso        NaN         NaN       NaN    ...       
18570        Belleville        NaN         NaN       NaN    ...       
18571          Brooklyn        NaN         NaN       NaN    ...       
18572         Las Vegas        NaN         NaN       NaN    ...       
18573  Long Island City       1985           5        29    ...       
18574         Cleveland       1950           8        20    ...       
18575            Eureka        NaN         NaN       NaN    ...       
18576             Bronx        NaN         NaN       NaN    ...       
18577          Brooklyn       1966           8        25    ...       
18578          Whittier        NaN         NaN       NaN    ...       
18579      Middle Amana       1982          11         2    ...       
18580            Encino        NaN         NaN       NaN    ...       
18581            Panama        NaN         NaN       NaN    ...       
18582       Chula Vista        NaN         NaN       NaN    ...       
18583        Cape Coral        NaN         NaN       NaN    ...       
18584        Pittsburgh        NaN         NaN       NaN    ...       
18585     San Francisco       2005           3        25    ...       
18586         San Mateo        NaN         NaN       NaN    ...       
18587           Holland       2014           9         8    ...       
18588         St. Louis       1978           3        27    ...       

          nameLast           nameGiven weight height bats throws       debut  \
0          Aardsma         David Allan    205     75    R      R    4/6/2004   
1            Aaron         Henry Louis    180     72    R      R   4/13/1954   
2            Aaron          Tommie Lee    190     75    R      R   4/10/1962   
3             Aase      Donald William    190     75    R      R   7/26/1977   
4             Abad       Fausto Andres    184     73    L      L   9/10/2001   
5             Abad    Fernando Antonio    220     73    L      L   7/28/2010   
6           Abadie             John W.    192     72    R      R  1875-04-26   
7      Abbaticchio        Edward James    170     71    R      R  1897-09-04   
8            Abbey           Bert Wood    175     71    R      R  1892-06-14   
9            Abbey          Charles S.    169     68    L      L  1893-08-16   
10          Abbott    Leander Franklin    190     71    R      R  1890-04-19   
11          Abbott     Harry Frederick    180     70    R      R   4/25/1903   
12          Abbott       William Glenn    200     78    R      R   7/29/1973   
13          Abbott     Jeffrey William    190     74    R      L   6/10/1997   
14          Abbott       James Anthony    200     75    L      L    4/8/1989   
15          Abbott         Kurt Thomas    180     71    R      R    9/7/1993   
16          Abbott       Lawrence Kyle    200     76    L      L   9/10/1991   
17          Abbott           Ody Cleon    180     74    R      R   9/10/1910   
18          Abbott          Paul David    185     75    R      R   8/21/1990   
19            Aber       Albert Julius    195     74    L      L   9/15/1950   
20     Abercrombie   Francis Patterson    NaN    NaN  NaN    NaN  1871-10-21   
21     Abercrombie   Reginald Damascus    215     75    R      R    4/4/2006   
22      Abernathie      William Edward    190     70    R      R   9/27/1952   
23       Abernathy       Michael Brent    185     73    R      R   6/25/2001   
24       Abernathy  Talmadge Lafayette    210     74    R      L   9/19/1942   
25       Abernathy       Theodore Wade    215     76    R      R   4/13/1955   
26       Abernathy      Virgil Woodrow    170     72    L      L   7/28/1946   
27         Aberson  Clifford Alexander    200     72    R      R   7/18/1947   
28           Ables       Harry Terrell    200     74    R      L    9/4/1905   
29           Abner        Shawn Wesley    190     73    R      R    9/8/1987   
...            ...                 ...    ...    ...  ...    ...         ...   
18559    Zimmerman      Jordan William    200     72    R      L   5/17/1999   
18560   Zimmermann           Jordan M.    220     74    R      R   4/20/2009   
18561    Zimmerman        Roy Franklin    187     74    L      L   8/27/1945   
18562    Zimmerman        Ryan Wallace    220     75    R      R    9/1/2005   
18563         Zink       Charles Tadao    190     73    R      R   8/12/2008   
18564         Zink        Walter Noble    165     72    R      R    7/6/1921   
18565         Zinn       Frank Patrick    150     68  NaN    NaN  1888-04-18   
18566         Zinn                 Guy    170     70    L      R   9/11/1911   
18567         Zinn        James Edward    195     72    L      R    9/4/1919   
18568       Zinser     William Francis    185     73    R      R   8/19/1944   
18569       Zinter        Alan Michael    200     74    B      R   6/18/2002   
18570       Zipfel    Marion Sylvester    200     75    L      R   7/26/1961   
18571         Zisk      Richard Walter    200     73    R      R    9/8/1971   
18572         Zito       Barry William    205     74    L      L   7/22/2000   
18573     Zitzmann      William Arthur    175     70    R      R   4/27/1919   
18574        Zmich       Edward Albert    180     72    L      L   7/23/1910   
18575      Zobrist     Benjamin Thomas    210     75    B      R    8/1/2006   
18576   Zoccolillo          Peter Jude    200     74    L      R    9/5/2003   
18577       Zoldak       Samuel Walter    185     71    L      L   5/13/1944   
18578        Zosky        Edward James    175     72    R      R    9/2/1991   
18579        Zuber       William Henry    195     74    R      R   9/16/1936   
18580        Zuber          Jon Edward    190     73    L      L   4/19/1996   
18581       Zuleta       Julio Ernesto    230     78    R      R    4/6/2000   
18582       Zumaya         Joel Martin    215     75    R      R    4/3/2006   
18583       Zunino     Michael Accorsi    220     74    R      R   6/12/2013   
18584       Zupcic              Robert    220     76    R      R    9/7/1991   
18585         Zupo        Frank Joseph    182     71    L      R    7/1/1957   
18586      Zuvella                Paul    173     72    R      R    9/4/1982   
18587     Zuverink              George    195     76    R      R   4/21/1951   
18588     Zwilling     Edward Harrison    160     66    L      L   8/14/1910   

        finalGame   retroID    bbrefID  
0       9/28/2013  aardd001  aardsda01  
1       10/3/1976  aaroh101  aaronha01  
2       9/26/1971  aarot101  aaronto01  
3       10/3/1990  aased001   aasedo01  
4       4/13/2006  abada001   abadan01  
5       9/27/2014  abadf001   abadfe01  
6      1875-06-10  abadj101  abadijo01  
7       9/15/1910  abbae101  abbated01  
8      1896-09-23  abbeb101  abbeybe01  
9      1897-08-19  abbec101  abbeych01  
10     1890-05-23  abbod101  abbotda01  
11      9/20/1905  abbof101  abbotfr01  
12       8/8/1984  abbog001  abbotgl01  
13      9/29/2001  abboj002  abbotje01  
14      7/21/1999  abboj001  abbotji01  
15      4/13/2001  abbok002  abbotku01  
16      8/24/1996  abbok001  abbotky01  
17     10/15/1910  abboo101  abbotod01  
18       8/7/2004  abbop001  abbotpa01  
19      9/11/1957  abera101   aberal01  
20     1871-10-21  aberd101  abercda01  
21      9/28/2008  aberr001  abercre01  
22      9/27/1952  aberb101  abernbi01  
23      9/29/2005  aberb001  abernbr01  
24      4/29/1944  abert102  abernte01  
25      9/30/1972  abert101  abernte02  
26      4/17/1947  aberw101  abernwo01  
27       5/9/1949  aberc101  aberscl01  
28       5/5/1911  ableh101  ablesha01  
29      10/3/1992  abnes001  abnersh01  
...           ...       ...        ...  
18559    7/2/1999  zimmj002  zimmejo01  
18560   9/28/2014  zimmj003  zimmejo02  
18561   9/30/1945  zimmr101  zimmero01  
18562   9/28/2014  zimmr001  zimmery01  
18563   8/12/2008  zinkc001   zinkch01  
18564   7/19/1921  zinkw101   zinkwa01  
18565  1888-05-03  zinnf101   zinnfr01  
18566   9/16/1915  zinng101   zinngu01  
18567   8/25/1929  zinnj101   zinnji01  
18568   8/26/1944  zinsb101  zinsebi01  
18569   10/3/2004  zinta001  zinteal01  
18570   9/30/1962  zipfb101  zipfebu01  
18571   9/21/1983  ziskr101   ziskri01  
18572   9/29/2013  zitob001   zitoba01  
18573   9/25/1929  zitzb101  zitzmbi01  
18574    6/1/1911  zmice101  zmiched01  
18575   9/27/2014  zobrb001  zobribe01  
18576   9/28/2003  zoccp001  zoccope01  
18577   8/26/1952  zolds101  zoldasa01  
18578   10/1/2000  zoske001  zoskyed01  
18579   9/23/1947  zubeb101  zuberbi01  
18580   9/27/1998  zubej001  zuberjo01  
18581   6/25/2001  zulej001  zuletju01  
18582   6/28/2010  zumaj001  zumayjo01  
18583   9/28/2014  zunim001  zuninmi01  
18584    8/4/1994  zupcb001  zupcibo01  
18585    5/9/1961  zupof101   zupofr01  
18586    5/2/1991  zuvep001  zuvelpa01  
18587   6/15/1959  zuveg101  zuverge01  
18588   7/12/1916  zwild101  zwilldu01  

[18589 rows x 24 columns]

Once the data is in the dictonary, we can access the information as such:



In [8]:

    
df3['album']['playcount']









    Out[8]:





u'1547296'

Here we can see that there are some entries that are not there. There is functiona called fillna that can be used to insert some value for the NaN entries. Here we insert a -1 for all NaN entries in the deathMonth column



In [9]:

    
print df['deathMonth'].head()









    



0   NaN
1   NaN
2     8
3   NaN
4   NaN
Name: deathMonth, dtype: float64



In [10]:

    
df['deathMonth'].fillna(-1).head()









    Out[10]:





0   -1
1   -1
2    8
3   -1
4   -1
Name: deathMonth, dtype: float64

	playerID	birthYear	birthMonth	birthDay	birthCountry	birthState	birthCity	deathYear	deathMonth	deathDay	...	nameLast	nameGiven	weight	height	bats	throws	debut	finalGame	retroID	bbrefID
0	aardsda01	1981	12	27	USA	CO	Denver	NaN	NaN	NaN	...	Aardsma	David Allan	205	75	R	R	4/6/2004	9/28/2013	aardd001	aardsda01
1	aaronha01	1934	2	5	USA	AL	Mobile	NaN	NaN	NaN	...	Aaron	Henry Louis	180	72	R	R	4/13/1954	10/3/1976	aaroh101	aaronha01
2	aaronto01	1939	8	5	USA	AL	Mobile	1984	8	16	...	Aaron	Tommie Lee	190	75	R	R	4/10/1962	9/26/1971	aarot101	aaronto01
3	aasedo01	1954	9	8	USA	CA	Orange	NaN	NaN	NaN	...	Aase	Donald William	190	75	R	R	7/26/1977	10/3/1990	aased001	aasedo01
4	abadan01	1972	8	25	USA	FL	Palm Beach	NaN	NaN	NaN	...	Abad	Fausto Andres	184	73	L	L	9/10/2001	4/13/2006	abada001	abadan01

	playerID	birthYear	birthMonth	birthDay	birthCountry	birthState	birthCity	deathYear	deathMonth	deathDay	...	nameLast	nameGiven	weight	height	bats	throws	debut	finalGame	retroID	bbrefID
18579	zuberbi01	1913	3	26	USA	IA	Middle Amana	1982	11	2	...	Zuber	William Henry	195	74	R	R	9/16/1936	9/23/1947	zubeb101	zuberbi01
18580	zuberjo01	1969	12	10	USA	CA	Encino	NaN	NaN	NaN	...	Zuber	Jon Edward	190	73	L	L	4/19/1996	9/27/1998	zubej001	zuberjo01
18581	zuletju01	1975	3	28	Panama	Panama	Panama	NaN	NaN	NaN	...	Zuleta	Julio Ernesto	230	78	R	R	4/6/2000	6/25/2001	zulej001	zuletju01
18582	zumayjo01	1984	11	9	USA	CA	Chula Vista	NaN	NaN	NaN	...	Zumaya	Joel Martin	215	75	R	R	4/3/2006	6/28/2010	zumaj001	zumayjo01
18583	zuninmi01	1991	3	25	USA	FL	Cape Coral	NaN	NaN	NaN	...	Zunino	Michael Accorsi	220	74	R	R	6/12/2013	9/28/2014	zunim001	zuninmi01
18584	zupcibo01	1966	8	18	USA	PA	Pittsburgh	NaN	NaN	NaN	...	Zupcic	Robert	220	76	R	R	9/7/1991	8/4/1994	zupcb001	zupcibo01
18585	zupofr01	1939	8	29	USA	CA	San Francisco	2005	3	25	...	Zupo	Frank Joseph	182	71	L	R	7/1/1957	5/9/1961	zupof101	zupofr01
18586	zuvelpa01	1958	10	31	USA	CA	San Mateo	NaN	NaN	NaN	...	Zuvella	Paul	173	72	R	R	9/4/1982	5/2/1991	zuvep001	zuvelpa01
18587	zuverge01	1924	8	20	USA	MI	Holland	2014	9	8	...	Zuverink	George	195	76	R	R	4/21/1951	6/15/1959	zuveg101	zuverge01
18588	zwilldu01	1888	11	2	USA	MO	St. Louis	1978	3	27	...	Zwilling	Edward Harrison	160	66	L	L	8/14/1910	7/12/1916	zwild101	zwilldu01

	birthYear	birthMonth	birthDay	deathYear	deathMonth	deathDay	weight	height
count	18444.000000	18274.000000	18117.000000	9225.000000	9224.000000	9223.000000	17717.000000	17780.000000
mean	1929.851496	6.625534	15.618977	1963.294309	6.485581	15.546677	185.562680	72.235264
std	40.907143	3.465354	8.748973	31.193396	3.522301	8.781102	20.998279	2.598992
min	1820.000000	1.000000	1.000000	1872.000000	1.000000	1.000000	65.000000	43.000000
25%	1894.000000	4.000000	8.000000	1941.000000	3.000000	8.000000	170.000000	71.000000
50%	1935.000000	7.000000	16.000000	1966.000000	6.000000	15.000000	185.000000	72.000000
75%	1967.000000	10.000000	23.000000	1988.000000	10.000000	23.000000	197.000000	74.000000
max	1994.000000	12.000000	31.000000	2014.000000	12.000000	31.000000	320.000000	83.000000

	playerID	birthCountry	bats	weight
0	quinnpa01	None	None	162
1	madurca01	Aruba	R	175
2	porraed01	At Sea	R	170
3	willije01	Australia	R	185
4	ulricdu01	Austria	R	195
5	richaan01	Bahamas	B	165
6	martich01	Belize	L	180
7	rienzan01	Brazil	R	190
8	woodpe01	CAN	R	185
9	kingmha01	China	L	165